NVIDIA Tensor Core Programmability, Performance&Precision

نویسندگان

Stefano Markidis

Steven Wei Der Chien

Erwin Laure

Ivy Bo Peng

Jeffrey S. Vetter

چکیده

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multicore Platforms for Scientific Computing: Cell BE and NVIDIA Tesla

There are two multicore platforms that are currently concentrating an enormous attention due to their tremendous potential in terms of sustained performance: the Cell Broadband Engine (Cell BE from now on) and the NVIDIA Tesla computing solutions. The former is a recent heterogeneous chip-multiprocessor (CMP) architecture jointly developed by IBM, Sony and Toshiba to offer very high performance...

متن کامل

Architectural Comparisons for a Quantum Monte Carlo Application

Recent technological advances have led to a number of emerging platforms such as multi-cores, reconfigurable computing, and graphics processing units. We present a comparative study of multi-cores, field-programmable gate arrays, and graphics processing units for a Quantum Monte Carlo chemistry application. The speedups of these implementations are measured relative to a multi-core implementati...

متن کامل

Multi-level Customisation Framework for Curve Based Monte Carlo Financial Simulations

One of the main challenges when accelerating financial applications using reconfigurable hardware is the management of design complexity. This paper proposes a multi-level customisation framework for automatic generation of complex yet highly efficient curve based financial Monte Carlo simulators on reconfigurable hardware. By identifying multiple levels of functional specialisations and the op...

متن کامل

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDAcapable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of O.N /. The methods described in the paper are applicable to any FMMs in wh...

متن کامل

Computing Spectropolarimetric Signals on Accelerator Hardware Comparing the Cell BE and NVIDIA GPUs

Rapid calculation of the Voigt profile is critical for high performance in computational spectropolarimetric analysis. The Curtis and Osborne approximation to the Voigt function is arithmetically dense and embarrassingly parallel which makes it an intriguing candidate for exploiting accelerator technologies. We implement versions for the Cell Broadband Engine and for an NVIDIA GPU, and compare ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

NVIDIA Tensor Core Programmability, Performance&Precision

نویسندگان

چکیده

منابع مشابه

Multicore Platforms for Scientific Computing: Cell BE and NVIDIA Tesla

Architectural Comparisons for a Quantum Monte Carlo Application

Multi-level Customisation Framework for Curve Based Monte Carlo Financial Simulations

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Computing Spectropolarimetric Signals on Accelerator Hardware Comparing the Cell BE and NVIDIA GPUs

عنوان ژورنال:

اشتراک گذاری